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The interlanguage links in Wikipedia connect pages on the same subject written in different 
languages. In theory, each connected component should be a clique and cover one topic. However, 
incoherent edits and obvious mistakes result in topic coalescence, yielding a non-trivial topology 
that is studied in this paper. We show that the component size distribution obeys the power law, 
and we explain anomalies in the distribution as results of certain edit conventions. Next, we propose 
a method of filtering out the cliques and study basic properties of the resulting skeleton, which turns 
out to be scale-free. 

PACS numbers: 89.75.Hc, 89.75.Da, 89.75.Fb 



In the recent years Wikipedia has been increasingly a 
subject of scientific study, both quahtative and quanti- 
tative [15,]. Its content serves as an excellent example of 
a large complex network which exhibits exponential 
growth in the number of contributors and text content 

. The growth has been described in terms of the pref- 
erential attachment mechanism jSj, and the dynamics of 
user contributions (e.g. conflict patterns) have been thor- 
oughly studied 0, 0] ■ 

In this paper we examine yet another, so far unde- 
scribed, facet of this network: the topology of the inter- 
language links. The analysis is based on database dumps 
retrieved on August 27, 2008. At that time, the interlan- 
guage links were defined as "links from any page (most 
notably articles) in one Wikipedia language to the same 
subject in another Wikipedia language" Q. Given this 
definition, the expected topology of the network is trivial: 
each subject should be represented by a separate, isolated 
clique consisting of all the pages on the subject, each 
clique should contain at most one page from any given 
language edition, and there should be no other links in 
the network. Mathematically speaking, the sum of all the 
links should form an equivalence relation =, satisfying an 
additional condition: 

a = b ^ a = bW lang{a) ^ lang{b) (1) 

However, the software engine that powers Wikipedia 
does not enforce coherence of the network: each page 
maintains a list of outgoing interlanguage links. There 
are user-controlled programs, so called bots, which add 
the missing links by performing symmetric and transitive 
closure. This means that for each link a —>■ b the bots 
add ^ a if missing, and for each pair of links a 
b ^ c add a — > c if missing. The opposite problem of 
removing the extra links is not trivial: while it is easy to 
detect a conflict using an automaton, resolving it requires 
understanding of the contents of the involved pages. For 
example, a simple program traversing the network may 
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discover the following conflict (a real example, which has 
already been corrected) : en: Tap ( valve ) = it:Rubinetto = 
es:Grifo = en:Griffin. Here a program may raise a flag 
since two pages in the same language are present in one 
connected component, but a human will have to read the 
articles to flnd the incorrect link(s). 

Let us proceed to study the properties of two networks 
of interlanguage links: one connecting the articles {A), 
and the other connecting the categories (C). In both 
cases we will treat the networks as undirected graphs, 
assuming a link a — 6 iff there is an interlanguage link 
a — > 6 or & <— a. 

Network A consists of 11 510 142 nodes and 89 339 694 
links. Approx. 42% of the nodes are isolated, and the 
remaining nodes are grouped into 1 223 183 connected 
components. Network C consists of 1 724088 nodes and 
13 902 852 links, approx. 51.5% of the nodes are isolated, 
and the rest are grouped into 118 039 connected compo- 
nents. 

We will say that a connected component is coherent 
when no two pages are in the same language [l6[ . and 
that it is complete when it contains all the possible links 
(i.e., is a clique). There are 59 323 incoherent components 
in A and 6 152 incoherent components in C. In both cases 
it is approx. 5% of all the non-singleton connected com- 
ponents. Completeness is correlated with coherence: for 
example in the case of A, 63% of the coherent compo- 
nents are complete, and 99% contain at least half of all 
the possible links. On the other hand, none of the in- 
coherent components are complete, and only about 61% 
contain at least half of all the possible links. 

Component size vs. rank plots for A and C are pre- 
sented in Figure [TJ Values for the coherent and incoher- 
ent components are plotted separately. For the incoher- 
ent components of each network, the number of English 
pages (a coarse measure of the number of topics) vs. rank 
is also shown. All the plots have logarithmic scales. 

The plotted points are (piecewise) well approximated 
by straight lines, which indicates that a component's size 
s is a power-law function of its rank r, namely: s ~ 
r"''. Power-law distributions are encountered in diverse 
settings, for example: the distribution of city sizes 0], 
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FIG. 1: Component sizes for the article network (top row) and the category network (bottom row). Sizes of the coherent 
components (left column), the incoherent components (middle column) and the number of English pages in the incoherent 
components (right column) are all plotted against their ranks. Each plot has a log-log scale. The "king effect" 0] in the case 
of incoherent components and the influence of mass-produced date-related topics on the shapes of the coherent component 
distributions are visible (both features are discussed in the text). 



occurrences of DNA base pair sequences [9| , and number 
of sent e-mails all follow the power law. For an 

excellent description of the distribution and numerous 
examples of its occurrences, see Ref. [llj and jl2j . 

Let us take a closer look at the obtained distributions. 
In the case of the coherent components of A (top left 
panel of Figure [T]), there are two clear regimes. Most of 
the top 2 200 or so components (the first regime) , each 
covered by at least 75 language editions, contain articles 
on the years of the current era and centuries. Such ar- 
ticles are easy to create in an automated way, and it is 
easy to maintain the interlanguage links to corresponding 
articles in the other language editions (easy maintenance 
explains why the components are coherent). An informal 
competition among the language editions for the largest 
number of articles might be an additional motivation for 
the mass-creation of the date-related pages. 

Similarly, two regimes in the component size distribu- 
tion of C (bottom left panel of Figure [1]) can also be ob- 
served, although the transition between them is smoother 
than in the previous case. Date-related topics account for 
about 75% of the top 5 000-6000 components (each con- 
taining at least 26-28 nodes). Among these are categories 
for years, decades, centuries, births and deaths in a given 
year, and (for the recent times) films and video games in 
a given year. Other prominent categories are: countries 
(including "History of ... " and "Geography of ... " as 
separate categories), and users speaking a given language 
on a given level. 

Moving on to the sizes of incoherent components, we 



note that the largest in A (top middle panel in Figure [T|) 
is well above the best-fit line (7 « 0.587). This anomaly 
is an example of the so called "king effect", discovered 
by Laherrere 0] while analyzing the sizes of the world's 
oilfields: the largest element is much larger than a log- 
log regression would predict. The same component is a 
clear outlier in the distribution of the number of English 
articles in components (top right panel in Figure [1]). 

To give a perspective: the largest component con- 
sists of 72 284 articles, including 3 184 articles in English, 
while in theory each component should contain at most 
one article in English, and its size should be bounded 
by the number of language editions, i.e., about 250. It 
contains articles on such a diverse subjects as: "Abelian 
group" , "Beekeeping" , "Chinese poetry" , and "Districts 
of Luxembourg" . The second largest component in terms 
of the total number of articles contains 7 004 nodes, while 
the second largest in terms of the number of English ar- 
ticles contains 221 nodes. 

Table U presents the parameters of the best fits corre- 
sponding to the lines in Figure [1] 

The left panel of Figure [5] presents the node degree 
distribution in both researched networks (note that both 
axes are logarithmic). The median node degree is 6 in ^ 
and 12 in C. There are two anomalies in the distribution 
in the case of the article network: a plateau spanning 
degrees 74-93, and a peak at degrees 117-119. Both phe- 
nomena have plausible explanations. The plateau is a 
result of articles where the subject is on years of the cur- 
rent era. Some editions contain articles on all the 2000-1- 
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TABLE I: The power law applied to the component sizes of 
the article and category networks. "C", "I" and "E" refer to 
the left, middle and right column in Figure [1] (respectively) . 
It is tested whether the relation between a component size s 
and its rank r is indeed s ~ r '. The last column denotes 
adjusted B? - a, measure of correlation. 







Range 


Fit results 




size 


ranks 


7 




Articles (C) 


[75,00) 


(-00,2173] 


0.120 


0.9852 


Articles (C) 


(-00,75) 


(2173,00) 


0.544 


0.9664 


Articles (I) 


[15,00) 


(-00,23 463] 


0.587 


0.9889 


Articles (E) 


(—00, 00) 


(—00, 00) 


0.469 


0.9690 


Categories (C) 


(-00,27] 


[5 638, 00) 


0.196 


0.9811 


Categories (C) 


(27,00) 


(-00,5 638) 


0.970 


0.9855 


Categories (I) 


(-00,20] 


[2 472, 00) 


0.493 


0.9897 




FIG. 2: (Left) Degree distribution for the article network 
(blue stars) and category network (red circles) . Log-log scale, 
degree omitted. The plateau at 74-93 and the peak at 
117-119 in the case of articles are commented on in the text. 
(Right) Degree distribution for the skeleton of the article net- 
work (log-log scale). The skeleton extraction procedure is 
described in the text. Only the incoherent components are 
accounted for, degree is omited. The distribution fits the 
power law with 7 « 3.75 (adjusted « 0.9770). The peak 
at 27-30 is commented on in the text. 



years, others only on the more recent years. For exam- 
ple, there are approx. 88 language editions covering year 
1709, 82 covering year 1209 and 74 covering year 509. 
The articles on a given year are usually forming a clique, 
thus each article has degree equal to one less than the 
size of the clique. As a consequence, we observe that an 
increased number of nodes with degrees 74-93 relates to a 
high number of cliques of sizes 75-94. On the other hand, 
the peak at degrees 117-119, relates to articles on days 
of the year. There are approximately 120 language edi- 
tions where such articles are present, and these editions 
usually contain articles on all the 366 days. Most of the 
groups of articles are connected in cliques, hence a peak 
in the degree distribution. The article with the highest 
degree (337) is "caiLlista de personatges de la Mitolo- 
gia Egipcia" which contains short descriptions of various 
gods of the Egyptian mythology. A number of pages on 
articles in other languages, including 42 from the English 
edition, contain interlanguage links to (redirects to) this 
page. 

Watts and Strogatz [l3| have demonstrated the useful- 




FIG. 3: Distribution of the clustering coefficient values of 
nodes. Only the nodes with degree > 9 have been accounted 
for. The left diagram presents distributions for the article 
network, the right diagram presents distributions for the cate- 
gory network. Circles denote values for the incoherent compo- 
nents, triangles - for the coherent ones. Note that the y-axis 
is logarithmic, so the vast majority of nodes have clustering 
coefficient close to one. 

ness of an indicator named clustering coefficient in de- 
scribing network topologies. The clustering coefficient of 
a perfectly coherent and complete network of interlan- 
guage links would be 100%. In reality, for both networks 
the value is quite high (approx. 97%), with over 98% 
for the coherent components, and approx. 91% for the 
incoherent ones. Figure [3] presents the distribution of 
the values of the clustering coefficient for nodes having 
at least 10 neighbors [l3|- As expected, a low clustering 
coefficient is fairly uncommon. More interesting is the 
conditional probability that a node with degree > 9 and 
clustering coefficient < 80% will be part of an incoherent 
component: it is approximately equal to 99.58% in the 
case of A and "only" 77.48% in the case of C. 

Let us summarize the results so far: having ana- 
lyzed the distributions of component sizes, degrees, and 
clustering coefficients, we have found that the network 
of interlanguage links mainly consists of "near-cliques" . 
There are rare connections between the cliques, which 
are usually symptoms of incoherence. Our next question 
is: what is the topology of these rare connections? 

We would like to extract the "skeleton" of an incoher- 
ent component, a network in which each set of nodes rep- 
resenting a given topic (usually a "near-clique" ) is shrunk 
to a single point, thus revealing the connections between 
separate topics. Of course, partitioning a network into 
topics requires expert knowledge, which we cannot pro- 
vide. Instead, we propose a very simple method of ex- 
tracting an approximate structure of the skeleton net- 
work: 

1. choose any of the most frequently occurring lan- 
guages, the nodes in this language will be reference 
nodes; 

2. for each node v: find the closest reference node(s) 
z(w); 

3. while there exists a pair of connected nodes wi, V2 
such that z(vi) = z{v2) and |z(wi)| — \z{v2)\ = 1: 
merge vi and V2; 
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FIG. 4: Skeleton of a medium-sized incoherent component 
(812 articles, including 47 in English). The skeleton extrac- 
tion procedure is described in the text. 

4. while there exists a pair of connected nodes Wi, V2 
s.t. > 1 and |2:(i'2)| > 1: merge vi and V2; 

5. while there exists a node v such that \z{v) \ — 2 and 

V is connected to exactly two other nodes: remove 

V and directly connect the two other nodes. 

Merging two nodes vi and V2 means replacing them with 
a new node v, connecting the new node with all the neigh- 
bors of vi and V2, and setting z{v) := z{vi) U z{v2). 

Figure [3] presents a result of the skeleton extraction 
procedure applied to a middle-sized component. After 
extracting the skeleton of the entire A network, the av- 
erage degree of a skeleton node is approx. 1.17, and the 
clustering coefficient is approx. 37%. The distribution 
of node degrees is shown in the right panel of Figure [D 
The distribution is power-law (with 7 « 3.75), indicat- 
ing a scale-free network. The peak at degrees 27-30 is 
yet another result of mass-edition, this time related to 



articles on the days of the year (such as: en:December 1, 
en:December 2, etc.) In 10 out of 12 cases, at least one 
language edition contains a bizarre copy-and-paste error 
that connects all the days of a given month, for exam- 
ple all the 30 articles on the days of September from the 
Hindu edition contain an interlanguage link to the article 
in Kannada on September 11. Thus, in the skeleton net- 
work, the node representing September 11 has 29 neigh- 
bors. Note that in the ideal case (no incoherence) the 
skeleton network should consist solely of isolated nodes, 
i.e., should contain no links at all. 

Summing up, we have presented the surprisingly com- 
plex topology of the interlanguage links in Wikipedia. 
Instead of a set of isolated cliques, the structure can be 
informally described as a scale- free network of loosely in- 
terconnected near-cliques. From a user's point of view, 
lack of coherence results in semantic drift, e.g. en:Pipeline 
and en: Vulture are connected by a series of interlanguage 
links which are supposed to model equivalence (cf. Fig- 
ure [J). 

The results of our research motivated us to create 
a web service (Ihttp : //wikitools . icm. edu.plD where 
Wikipedians may find detailed analysis of each incoher- 
ent component, together with relevant edit recommenda- 
tions. We advertized the web service on the Wikipedia's 
mailing list, which initiated a short discussion fl4\. Fol- 
lowing the discussion, a Wikipedian changed the word- 
ing of the definition of an interlanguage link from "to 
the same subject" to "to one or more nearly equivalent 
or exactly equivalent pages". However, the topology de- 
scribed in this paper indicates serious incoherence even 
under the slightly relaxed definition. 
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